Deploying and Serving LLM with Llama.cpp

End-to-end guide: deploy and serve LLMs locally and at scale with llama.cpp for efficient CPU/GPU inference

Published

July 16, 2025

Keywords: llama.cpp, LLM serving, model deployment, GGUF, quantization, inference optimization, CPU inference, GPU inference, OpenAI API, production LLM

Introduction

Not every LLM deployment requires a high-end GPU cluster. Many real-world use cases — edge devices, local development, cost-sensitive production — demand efficient inference on commodity hardware.

llama.cpp is an open-source C/C++ inference engine that makes LLM deployment:

Hardware-flexible (runs on CPU, GPU, Apple Silicon, and hybrid CPU+GPU)
Memory-efficient (via aggressive quantization — 2-bit to 8-bit GGUF formats)
Production-ready (built-in OpenAI-compatible API server)
Portable (no Python runtime or CUDA toolkit required at inference time)
Easy to deploy (single binary, Docker, embedded systems)

In this tutorial, we will walk through a complete pipeline:

Install and build llama.cpp
Obtain and quantize models in GGUF format
Run offline inference from the CLI
Serve a model with the OpenAI-compatible API server
Query the model from Python
Optimize for production deployment

graph LR
    A["Install<br/>llama.cpp"] --> B["Obtain GGUF<br/>model"]
    B --> C["CLI inference<br/>or API server"]
    C --> D["Query from<br/>Python / curl"]
    D --> E["Deploy to<br/>production"]

    style A fill:#ffce67,stroke:#333
    style B fill:#ffce67,stroke:#333
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#56cc9d,stroke:#333,color:#fff

What is llama.cpp?

llama.cpp is a high-performance C/C++ inference engine for LLMs, originally created by Georgi Gerganov. Key features include:

GGUF format: Purpose-built model format with embedded metadata, supporting a wide range of quantization levels
Quantization: Reduce model size and memory usage with minimal quality loss (Q2_K through Q8_0)
CPU optimization: AVX, AVX2, AVX-512, ARM NEON for fast inference without a GPU
GPU offloading: Offload layers to NVIDIA (CUDA), AMD (ROCm), Apple Metal, or Vulkan GPUs
Built-in server: OpenAI-compatible HTTP API with continuous batching
Support for many models: Llama, Mistral, Qwen, Phi, Gemma, DeepSeek, and more

graph TD
    A["llama.cpp"] --> B["GGUF Format<br/>Embedded metadata"]
    A --> C["Quantization<br/>Q2_K to Q8_0"]
    A --> D["CPU Optimized<br/>AVX / AVX2 / NEON"]
    A --> E["GPU Offloading<br/>CUDA / Metal / Vulkan"]
    A --> F["Built-in Server<br/>OpenAI-compatible"]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#6cc3d5,stroke:#333,color:#fff
    style F fill:#6cc3d5,stroke:#333,color:#fff

Hardware Requirements

llama.cpp is designed to run on a wide range of hardware, from laptops to servers:

Model Size	Quantization	RAM/VRAM Needed	Recommended Hardware
0.5B–3B	Q4_K_M	2–3 GB	Any modern CPU / Raspberry Pi 5
7B–8B	Q4_K_M	5–6 GB	16 GB RAM laptop / RTX 3060
13B	Q4_K_M	9–10 GB	32 GB RAM / RTX 4090
70B	Q4_K_M	40–45 GB	64 GB RAM / A100 (multi-GPU)

Key advantage: llama.cpp can run entirely on CPU, making it ideal for environments without GPUs.

Installation

Option A: Install via pip (Recommended)

The easiest way to install llama.cpp’s Python bindings and server:

pip install llama-cpp-python

With GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

With Apple Metal support:

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Option B: Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

CPU-only build:

cmake -B build
cmake --build build --config Release

With CUDA support:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

With Apple Metal support:

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

Verify installation

./build/bin/llama-cli --version

Obtaining GGUF Models

graph TD
    A["Need a GGUF model"] --> B{"Source?"}
    B -->|"Pre-quantized"| C["Download from<br/>HuggingFace"]
    B -->|"Own model"| D["Convert HF model<br/>to GGUF (F16)"]
    D --> E["Quantize to<br/>Q4_K_M / Q5_K_M"]
    C --> F["Ready for<br/>inference"]
    E --> F

    style A fill:#f8f9fa,stroke:#333
    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#ffce67,stroke:#333
    style E fill:#ffce67,stroke:#333
    style F fill:#56cc9d,stroke:#333,color:#fff

Download Pre-quantized Models from HuggingFace

Many models are available pre-quantized in GGUF format:

# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download unsloth/Qwen3-0.6B-GGUF Q4_K_M.gguf --local-dir ./models

Quantize a Model Yourself

If you have a model in HuggingFace (safetensors) format, convert and quantize it:

# Step 1: Convert to GGUF (F16)
python convert_hf_to_gguf.py ./hf_model_dir --outfile model-f16.gguf --outtype f16

# Step 2: Quantize to Q4_K_M
./build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

Common Quantization Levels

Quantization	Bits	Size (7B)	Quality	Speed
Q2_K	2	~2.7 GB	Low	Fastest
Q4_K_M	4	~4.1 GB	Good	Fast
Q5_K_M	5	~4.8 GB	Very Good	Medium
Q6_K	6	~5.5 GB	Excellent	Medium
Q8_0	8	~7.2 GB	Near-FP16	Slower
F16	16	~13.5 GB	Lossless	Slowest

Recommendation: Q4_K_M offers the best balance of quality, size, and speed for most use cases.

Offline Inference (CLI)

Use llama.cpp for fast inference directly from the command line.

Basic Text Generation

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    -p "Explain machine learning in simple terms." \
    -n 256 \
    --temp 0.7

Interactive Chat Mode

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    --chat-template chatml \
    -cnv \
    --temp 0.7

With GPU Offloading

Offload layers to GPU for faster inference (use -ngl to specify number of layers):

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    -p "What is the difference between AI and ML?" \
    -n 256 \
    -ngl 99 \
    --temp 0.7

-ngl 99 offloads all layers to GPU. Use a lower number for partial offloading when VRAM is limited.

Batch Inference with Python

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # -1 = offload all layers to GPU
)

prompts = [
    "Explain machine learning in simple terms.",
    "What is the difference between AI and ML?",
    "Write a Python function to reverse a string.",
]

for prompt in prompts:
    output = llm(
        prompt,
        max_tokens=256,
        temperature=0.7,
    )
    print(output["choices"][0]["text"])
    print("---")

Chat-style Inference with Python

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    chat_format="chatml",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain llama.cpp in simple terms."},
]

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7,
)

print(output["choices"][0]["message"]["content"])

Serving with OpenAI-Compatible API

llama.cpp includes a built-in HTTP server that is fully compatible with the OpenAI API format.

graph TD
    A["llama-server"] --> B["Load GGUF model"]
    B --> C["OpenAI-compatible<br/>/v1/chat/completions"]
    C --> D["curl"]
    C --> E["Python requests"]
    C --> F["OpenAI Python client"]

    G["Options"] --> H["-ngl: GPU layers"]
    G --> I["-c: Context length"]
    G --> J["--slots: Parallelism"]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#f8f9fa,stroke:#333
    style E fill:#f8f9fa,stroke:#333
    style F fill:#f8f9fa,stroke:#333
    style G fill:#ffce67,stroke:#333

Start the Server (C++ binary)

./build/bin/llama-server \
    -m ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 4096 \
    --slots 4

Key options explained:

-m: Path to the GGUF model file
-ngl 99: Number of layers to offload to GPU (99 = all layers)
-c 4096: Context length (max tokens for input + output)
--slots 4: Number of concurrent request slots (controls parallelism)
--chat-template: Chat template format (chatml, llama2, mistral, etc.)
--api-key: API key for authentication

Start the Server (Python)

python -m llama_cpp.server \
    --model ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --n_gpu_layers -1 \
    --chat_format chatml \
    --n_ctx 4096

Verify the Server

curl http://localhost:8000/v1/models

Querying the API

Using curl

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer your-secret-key" \
    -d '{
        "model": "default",
        "messages": [
            {"role": "user", "content": "What is llama.cpp?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256
    }'

Using Python (requests)

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    headers={"Authorization": "Bearer your-secret-key"},
    json={
        "model": "default",
        "messages": [
            {"role": "user", "content": "Explain GGUF quantization."}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    }
)

print(response.json()["choices"][0]["message"]["content"])

Using OpenAI Python Client (Recommended)

Since llama.cpp server is OpenAI-compatible, you can use the official OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantization in LLMs?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Serving Custom / Fine-tuned Models

If you fine-tuned a small LLM with Unsloth and exported it to GGUF (e.g., gguf_model_small), here is how to serve it with llama.cpp.

graph TD
    A["Fine-tuned model"] --> B{"Format?"}
    B -->|"Already GGUF"| C["Serve directly<br/>with llama-server"]
    B -->|"HF safetensors"| D["Convert to GGUF<br/>(convert_hf_to_gguf.py)"]
    B -->|"LoRA adapter"| E["Serve with<br/>--lora flag"]
    D --> F["Quantize<br/>(llama-quantize)"]
    F --> C
    C --> G["OpenAI-compatible API"]
    E --> G

    style A fill:#f8f9fa,stroke:#333
    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#ffce67,stroke:#333
    style F fill:#ffce67,stroke:#333
    style E fill:#6cc3d5,stroke:#333,color:#fff
    style G fill:#56cc9d,stroke:#333,color:#fff

Option A: Serve a GGUF File Directly

Step 1: Prepare Your GGUF Model

After fine-tuning with Unsloth and exporting to GGUF, you should have a file like:

gguf_model_small/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── unsloth.BF16.gguf
├── unsloth.Q4_K_M.gguf
└── vocab.json

Step 2: Serve with llama.cpp

./build/bin/llama-server \
    -m ./gguf_model_small/unsloth.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 2048 \
    --slots 4

Step 3: Verify and Query

curl http://localhost:8000/v1/models \
    -H "Authorization: Bearer your-secret-key"

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is fine-tuning?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Option B: Convert from HuggingFace Format

If your model is in HuggingFace safetensors format, convert it first:

# Convert to GGUF
python convert_hf_to_gguf.py ./hf_model_small --outfile my-model-f16.gguf --outtype f16

# Quantize
./build/bin/llama-quantize my-model-f16.gguf my-model-Q4_K_M.gguf Q4_K_M

# Serve
./build/bin/llama-server \
    -m my-model-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 2048

Serve with LoRA Adapters

llama.cpp supports applying LoRA adapters at inference time without merging:

./build/bin/llama-server \
    -m ./models/base-model-Q4_K_M.gguf \
    --lora ./lora-adapter.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    -ngl 99

Then query normally:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)

Docker Deployment

Deploy llama.cpp in a container for production environments.

graph LR
    subgraph cpu["CPU Deployment"]
        A1["llama.cpp:server"] --> B1["Mount models<br/>volume"]
        B1 --> C1["Expose port 8000"]
    end
    subgraph gpu["GPU Deployment"]
        A2["llama.cpp:server-cuda"] --> B2["--gpus all<br/>+ mount models"]
        B2 --> C2["Expose port 8000"]
    end

    style cpu fill:#6cc3d5,stroke:#333,color:#fff
    style gpu fill:#56cc9d,stroke:#333,color:#fff

Run with Docker (CPU)

docker run -p 8000:8000 \
    -v ./models:/models \
    ghcr.io/ggerganov/llama.cpp:server \
    -m /models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -c 4096

Run with Docker (CUDA GPU)

docker run --gpus all -p 8000:8000 \
    -v ./models:/models \
    ghcr.io/ggerganov/llama.cpp:server-cuda \
    -m /models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 99 \
    -c 4096

Docker Compose

version: '3.8'
services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      -m /models/Q4_K_M.gguf
      --host 0.0.0.0
      --port 8000
      -ngl 99
      -c 4096
      --slots 4

Performance Optimization Tips

GPU offloading: Use -ngl 99 to offload all model layers to GPU; use a lower value for partial offloading when VRAM is limited
Context length: Set -c to the minimum context length you need — larger context uses more memory
Concurrent slots: Use --slots N to control how many requests can be processed in parallel
Quantization choice: Q4_K_M is the sweet spot for most use cases; use Q5_K_M or Q6_K if quality matters more than speed
Memory mapping: llama.cpp uses mmap by default, allowing models to be loaded without duplicating in RAM
Batch size: Use -b to tune the prompt processing batch size (default 2048); larger values can speed up prompt processing
Flash attention: Use --flash-attn to enable Flash Attention for faster inference (if supported)
Streaming: Use stream=True for real-time token generation

llama.cpp vs Other Serving Solutions

Feature	llama.cpp	vLLM	Ollama	TGI
CPU Inference	Excellent	No	Good	No
GPU Inference	Good	Excellent	Good	Excellent
Throughput	Low-Medium	Very High	Medium	High
GPU Required	No	Yes	No	Yes
OpenAI API	Yes	Yes	Partial	Yes
Multi-GPU	Limited	Yes	No	Yes
Quantization	Extensive (GGUF)	AWQ/GPTQ	GGUF	AWQ/GPTQ
Ease of Use	Medium	Medium	Easy	Medium
Best For	Edge/CPU/Local	Production GPU	Local Dev	Production GPU

Conclusion

llama.cpp is the go-to solution for flexible, hardware-efficient LLM deployment:

Runs on CPU, GPU, Apple Silicon, and hybrid configurations
Supports extensive quantization for reduced memory and faster inference
Provides an OpenAI-compatible API server for seamless integration
Handles custom and fine-tuned models in GGUF format
Deploys easily with Docker or as a single binary

This workflow is perfect for:

Local development and prototyping
Edge and embedded AI deployments
Cost-sensitive production environments
CPU-only server deployments
Laptop and desktop AI applications

Combine with a RAG pipeline (LangChain + llama.cpp)
Add load balancing with Nginx or Traefik
Deploy on Kubernetes with mixed CPU/GPU node pools
Monitor with Prometheus + Grafana (llama.cpp exposes /metrics)
Explore speculative decoding for faster inference
Use grammar-constrained generation for structured output (JSON mode)

Introduction

What is llama.cpp?

Hardware Requirements

Installation

Option A: Install via pip (Recommended)

Option B: Build from Source

CPU-only build:

With CUDA support:

With Apple Metal support:

Verify installation

Obtaining GGUF Models

Download Pre-quantized Models from HuggingFace

Quantize a Model Yourself

Common Quantization Levels

Offline Inference (CLI)

Basic Text Generation

Interactive Chat Mode

With GPU Offloading

Batch Inference with Python

Chat-style Inference with Python

Serving with OpenAI-Compatible API

Start the Server (C++ binary)

Start the Server (Python)

Verify the Server

Querying the API

Using curl

Using Python (requests)

Using OpenAI Python Client (Recommended)

Serving Custom / Fine-tuned Models

Option A: Serve a GGUF File Directly

Step 1: Prepare Your GGUF Model

Step 2: Serve with llama.cpp

Step 3: Verify and Query

Option B: Convert from HuggingFace Format

Serve with LoRA Adapters

Docker Deployment

Run with Docker (CPU)

Run with Docker (CUDA GPU)

Docker Compose

Performance Optimization Tips

llama.cpp vs Other Serving Solutions

Conclusion

Read More